[% setvar title Regex modifier for support of chunk processing and prefix matching %]
<div id="archive-notice">
    <h3>This file is part of the Perl 6 Archive</h3>
    <p>To see what is currently happening visit <a href="http://www.perl6.org/">http://www.perl6.org/</a></p>
</div>
<div class='pod'>
<a name='TITLE'></a><h1>TITLE</h1>
<p>Regex modifier for support of chunk processing and prefix matching</p>
<a name='VERSION'></a><h1>VERSION</h1>
<pre>  Maintainer: Bart Lateur &lt;<a href='mailto:bart.lateur@skynet.be'>bart.lateur@skynet.be</a>&gt;
  Date: 23 Sep 2000
  Mailing List: <a href='mailto:perl6-language-regex@perl.org'>perl6-language-regex@perl.org</a>
  Number: 316
  Version: 1
  Status: Developing</pre>
<a name='ABSTRACT'></a><h1>ABSTRACT</h1>
<p>This RFC proposes to add a regex modifier, which allows to reliably
recognize possible matches that span multiple records. In addition, it
can recognize if a string is/ends with a string that may possibly be
the start of a matching string.</p>
<a name='DESCRIPTION'></a><h1>DESCRIPTION</h1>
<p>Currently, string processing happens using one of two approaches:</p>
<ul>
<li><a name='on a line by line (record by record) basis'></a>on a line by line (record by record) basis</li>
<li><a name='the whole string under test is loaded in memory'></a>the whole string under test is loaded in memory</li>
</ul>
<p>Sometimes, you are not able to reliably find a way to split up the
input so that no match could possibly span multiple input records. But
at the same time, the input file is too big to be loaded in memory in
one piece. What to do?</p>
<p>If you just split the data into chunks of arbitrary length and process
them independently, you may have the problem that some matching strings
aren't recognized, or worse still, that the wrong kind of match is found
instead.</p>
<a name='Example'></a><h2>Example</h2>
<p>You want to extract occurrences of the string 'abcd', or, as a second
choice, 'bc', using the regex:</p>
<pre>    /abcd|bc/
    </pre>
<p>You process this in chunks of 1k each. It can happen that an occurrence
of the string abcd is broken into two parts, between chunks. What will
happen, is one of the following:</p>
<p>&lt;pre&gt;</p>
<pre>    end of  start of    /abcd/bc/   
    first   second      will allow
    chunk   chunk       you to find:
    ------  --------    ------------
             abcd        abcd
     a       bcd         bc
     ab      cd         (no match)
     abc     d           bc
     abcd                abcd</pre>
<p>&lt;/pre&gt;</p>
<a name='Remedy'></a><h2>Remedy</h2>
<p>My solution is to still use the same regex, but with an extra modifier.
This modifier makes the regex engine abort its search, as soon as the
end of the string is prematurely reached. In the above example, if &quot;abc&quot;
is found, no attempt would be made to try to match it against &quot;bc&quot;.</p>
<p>For this modifier, I chose the letter 'z', for its mnemonic relationship
with &quot;\z&quot;, the regex marker for end-of-string.</p>
<p>In addition, pos() is set to the offset of the start of the recognized
match prefix. In case of a plain succesful match, or of a normal
not-found termination, pos is undef() on exit.</p>
<p>This serves both as a flag, as pos will only be defined if the search
has been aborted for this reason, and it allows more optimized
searching,
because after you have appended the next chunk to the current one, the
next try will simply start again at the position where the pattern may
first match, skipping any earlier matches. In the above example, that
would be at the &quot;a&quot;.</p>
<p>Exception-happy people would most likely implement it using exceptions,
and throw an exception when the end of the buffer is prematurely
reached. But I'm not like that. Reaching the end of a string is not an
exception.</p>
<a name='Application'></a><h2>Application</h2>
<p>The following code can do the task described, provided that matches are
rare enough so that the buffer probably won't grow too big:</p>
<pre>    my $chunksize = 1024;
    while(read FH, my $buffer, $chunksize) {
        while(/(abcd|bc)/gz) {
            # do something boring with the matched string:
            print &quot;$1\n&quot;;
        }
        if(defined pos) {  # end-of-buffer exception
            # append the next chunk to the current one
            read FH, $buffer, $chunksize, length $buffer;
            # retry matching
            redo;
        }
    }

    </pre>
<a name='Procedural programming style'></a><h2>Procedural programming style</h2>
<p>As you can see from the example code, the program flow stays very close
to what people would ordinarily program under normal circumstances.</p>
<p>By contrast, RFC 93 proposes another solution to the same problem, but
using callbacks. Since the same sub must do one of several things, the
first thing that needs to be done is to channel different kinds of
requests to their own handler. As a result, you need a complete rewrite
from what you'd use in the ordinary case.</p>
<p>I think that a lot of people will find my approach far less
intimidating.</p>
<a name='Match prefix'></a><h2>Match prefix</h2>
<p>It can be useful to be able to recognize if a string could possibly be a
prefix for a potential match. For example in an interactive program,
you want to allow a user to enter a number into an input field, but
nothing else. After every single keystroke, you can test what he just
entered against a regex matching the valid format for a number, so that
<code>1234E</code> can be recognized as a prefix for the regex</p>
<pre>    /^\d+\.?\d*(?:E[+-]?\d+)$/
    </pre>
<p>I originally had thought of providing a separate, dedicated regex
modifier, just for the match prefix, but I don't think too many people
need this that desperately. You can easily build a working application
with just the '/z' modifier. If you can't, you're in over your head,
anyway.  ;-)</p>
<a name='IMPLEMENTATION'></a><h1>IMPLEMENTATION</h1>
<p>The regex engine must be given the option to abort the search, and
<b>not</b> do any backtracking, as soon as it attempts to do a lookahead to
the next character, but it finds the end of the string instead.</p>
<p>In addition, pos() must be patched so that under this modifier,
it normally is reset to undef(), but when this exception takes place,
pos is set to the offset of the start of the prefix.</p>
<a name='REFERENCES'></a><h1>REFERENCES</h1>
<p>RFC 93: Regex: Support for incremental pattern matching</p>
</div>
